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1. Summary 

Problem Statement 

In this Task we evaluate the feasibility of using the hypercube-connected 
concurrent processor systems for problems in Computational Fluid 
Dynamics(CFD). We have found that concurrent processor systems can be a 
cost effective approach to CFD. 

A eeompliahmenta 

We have designed and implemented a Lax-Wendroff explicit method for the 
Navier-Stokes equations. Our code runs on the Intel iPSC concurrent processor 
system. Tests of this code show that it is reasonably efficient. This is so 
because the computation dominates the communication as the size of the prob- 
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lem domain increases. On a 32 node hypercube we obtained a sustained process- 
ing rate (including the cost of communication) of over 475,000 floating point 
operations per second. Comparison of the identical code on a VAX 11/780 (one 
node) shows that the cube achieves a floating point operation rate that is about 
15 times faster than the VAX. 

We have also designed and implemented the Beam and Warming implicit 
factored method for Berger’s equation. This code has been run on the Intel iPSC 
hypercube. Preliminary tests show that the efficiency of this code is poor. The 
reason for this stems from the fact that a substantial amount of communication 
is required throughout the computation. The efficiency would no doubt improve 
with the Navier-Stokes equations since the amount of computation relative to 
communication increases. Additionally, there is little chance to overlap the com- 
munication with the computation. Our implementation indicates that improve- 
ments in the communication architecture would improve the efficiency of our 
implementation. 

On the Intel iPSC there is a large overhead for communication between 
nodes and that large numbers of small messages can seriously impact the effi- 
ciency of a computation[KO]. There are a number of issues concerning the Intel 
communication architecture. 

1. There is no separate I/O processor in the node. The communica- 
tion coprocessors in the Intel node are not programmable. 


m 


r j 
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2. Too much data copying is required for I/O. 

3. Processes in the same node cannot share address space. The 
above factors make it impossible to reduce the communication 
penalty by overlapping communication with computation. 

The 512K bytes of main memory is inadequate for CFD. A version of the 
Intel iSPC is available with 4.5 Mbytes of main memory per node and this confi- 
guration would be significantly more useful in CFD experiments. 

Projections 

1. It is important to continue the evaluation of the CFD codes that 
have been developed especially on a hypercube with substantially 
more node memory. The Navier Stokes equations should be 
implemented using the Beam and Warming implicit factored 
method. 

2. For the sake of completeness, a spectral method should be imple- 
mented. Much would be learned from this project. 

3. Codes developed at Ames should be ported to other concurrent 
processor systems such as the JPL Mark III concurrent processor 
system. Since this system supports overlapped I/O, the efficiency 
of our implementations should improve. 
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4. A study into the feasibility of using the hypercube concurrent 
processor system for grid generation should be undertaken. 
Careful consideration should be given to graphical input and out- 
put, With the grid generation component in place, the previous 
codes should be expanded to include metric terms. This 'will help 
to improve efficiency. 

5. Once the basic methods for CFD have been implemented and 
tested, zonal methods should be investigated. 
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2. Introduction 

In this report we present the results of an investigation into the feasibility of 
using hypercube concurrent processor systems for problems in Computational 
Fluid Dynamics(CFD). Specifically, we are interested in the numerical simula- 
tion of the Navier-Stokes equations for situations in which few simplifying 
assumptions can be made and for high Reynolds number. If we insist on a direct 
simulation of the Navier-Stokes equations for any practical problem, an enor- 
mous number of grid points is required to resolve the wide range of scales of 
motion. An alternative to numerically resolving all scales of motion is to model 
those scales not resolvable by the computational grid in terms of the resolvable 
scales. This technique is referred to as Large Eddy Simulation(LES) and is also 
computationally intensive. In either case we are faced with problems whose com- 
putational demands in time and space are beyond the capacity of today’s super- 
computers. Thus we are motivated to consider multiprocessor architectures with 
the hope that they may be able to cope with the demands of CFD. 

Fine- Grain Parallelism 

When computational requirements outpace current technology, we often 
turn to (parallel) concurrent architectures. There has been a great deal of effort 
directed at speeding up single-processor systems using fine-grain parallelism. For 
example, pipeline techniques have been applied to instruction processing and to 
the construction of arithmetic units. Multiple memory banks are used to obtain 


TR-88.7 


March 18, 1988 


* 7 • 


adequate memory bandwidth to keep up with today’s central processors. The 
supercomputers offered by manufacturers such as CRAY and CDC represent the 
ultimate in this direction, 

An important feature of these systems is that programmers are shielded 
from the architectural complexities. The architectural features are usually hid- 
den by the compiler technology which attempts to generate code that takes 
advantage of the architecture. For the most part, scientists have been able to 
maintain the abstraction provided by some programming language, such as 
FORTRAN. Often, better results are obtained if the high level code is written 
with the compiler optimizations in mind (such as writing DO loops which are 
easy for the compiler to "vectorize”) and additionally, one could write low-level 
code which makes more effective use of the hardware. Many subroutine libraries 
have been created using this latter approach. However, the vast majority of the 
users of these systems are not overly concerned with their inherent complexities. 
This is not true for the following classes of concurrent computer systems. 

Shared Memory • 

Multiprocessor systems with shared memory have been designed to help 
meet the challenge, but these systems are most often used to provide higher 
throughput rather than true parallel computation for a single instance of a prob- 
lem. The reason for this is that we are not yet able to hide this kind of parallel- 
ism from the programmer. The programmer must explicitly design and imple- 
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ment algorithms which take advantage of the multiple processors since compiler 
technology has not been designed which can effectively shield the programmer 
form the hardware environment, (An important research effort in the area of 
automatic program restructuring is the Cedar Project at the University of Illi- 
nois, Its purpose is to provide software which will automatically detect instances 
where concurrent processors could be used effectively, Generally speaking, this 
type of compiler technology is unavailable and we are forced to design parallel 
algorithms which take explicit account of the concurrent hardware or find paral- 
lel implementations for known sequential algorithms.) 

Shared-memory systems employ mechanisms which permit the processors to 
access shared memory. These mechanisms can be as simple as common bus or as 
complicated as a crossbar switch, and generally we expect to find something in 
between. Bus architectures suffer from too little bandwidth, for which they usu- 
ally compensate with mechanisms such as processor caches. On the other hand, 
a full crossbar switch is prohibitive in hardware and expense, Other approaches 
trade interconnection complexity with ease and cost of construction. 

Message Passing 

Another important class of concurrent processor systems is the one in which 
there is no shared memory. The processors communicate with each other by 
sending messages (control and data) through a network. There is a wide range of 
possible configurations depending on processor and network complexity. 


TR-Bfl.7 


• 9 . 


March 18, 1989 


Regardless of the particular level of node complexify and the choice of intercon- 
nection network, the programmer is faced with the task of designing and imple- 
menting algorithms which take the system architecture into account. In general, 
the problem domain must be partitioned among th* processors and the program- 
mer must specify the details of the corresponding data communication. 

In this report we study the feasibility of using concurrent processor systems 
with no shared memory for problems in CFD. We shall consider implementa- 
tions of explicit and implicit numerical methods for the Navier-Stokes equations. 

In the next section w* describe some of the features of concurrent processor 
systems. After this we present the equations of interest, the Navier-Stokes equa- 
tions, and give two numerical methods for their simulation. Finally we describe 
implementations of these methods on an Intel iPSC 32-node hypercube. 
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3. Process* 

We define a concurrent processor system as a collection of nodes, where each 
node consists of a processor, arithmetic unit(s), local memory, and an intercon- 
nection network which provides the means through which nodes communicate 
with each other, Each node executes its own instruction stream and may have 
access to external storage media and I/O devices, Since there is no shared 
memory, the only way for processors to communicate with each other is by send- 
ing messages through the network. There are many kinds of interconnection net- 
works and we breifly describe a few of the more important ones. Throughout th 
rest of the paper we will let N denote the number of nodes in the system. The 
nodes are numbered beginning with 0 and ranging up to N - 1. 


3.1. Interconnection Architectures 

Ring: We say the nodes are connected in a ring if there is a communiation link 
from node i to node i +1 {modN ) for i -1, The ring architecture has 

the advantage that it is very simple and obviously scalable!. Interestingly 
enough there are many problems for which the ring architecture is sufficient. 
Unfortunately, there are also many problems for which the ring architecture is 
not rich enough in the sense that the communication overhead caused by large 
distances between the nodes is more than we can tolerate. For example, imple- 


t A deiifn U scalable If it can be adjusted up or down in iiee without lose of efficiency or functionality. 
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mentations of the FFT algorithm on a ring connected systems suffer from exces* 
sive communication overhead. 

Mt$h; The 2-dimensional mesh architecture represents an attempt to minimise 
the communication distance between the nodes. In this case we envision the 
nodes set out in a two-dimensional array with horizontal and vertical communi- 
cation links between adjacent nodes. 

Hyptrcube: Assume that N « 2 k . Let bin (i ,k ) denote the k-bit binary 
representation of the integer i where 0<i' <2* , The hyptrcube interconnection 
architecture provides a direct communication link between node i and node j if 
bin ( i ,k ) and bin (j ,k) differ in exactly one bit position. It follows that each 
node has a direct communication link with k other nodes and that the maximum 
number of communication links that is needed for any pair of nodes to communi- 
cate is k . For example, consider nodes 0 and 2* -1. Clearly, bin [0,k ) and 
bin (2* — 1 ,Ar ) differ in all n bit positions and therefore any message sent from 
node 0 to node 2 4 -1 must pass through k - 1 intermediate nodes and use at 
least k different communication links. The hypercube network contains a total 

of ~ N log 2 N bidirectional communication links. 







- - 
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HypdRxuae 


Butterfly: The butterfly network is frequently associated with the FFT algorithm, 
A butterfly network consists of k +1 ranks of network nodes. We denote the i th 
node on the r th rank by p„- for O^i <N and 0<r ^k . Then node p r ; on rank 
r >0 is connected to two nodes on rank r -1, the two nodes p r _j ; such that 
either j =i or the binary representation of j differs from i in only the r th placi 
from the left. 


hse 


h =. 1 


h = Z 


r- 3 
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Usually, the r =0 and the r =k ranks are identified and correspond to the pro- 
cessor nodes and the remaining ranks contain network switching nodes. 

The butterfly network is closely related to the hypercube network. If we coalesce 
all the nodes in the same column, then the network reduces to the hypercube 
network. 

Shuffle- Exchange: The exchange interconnection consists of links between nodes i 
and i +1 if t is even. The shuffle interconnection provides a link from processor 
i to 2 » ( mod N -1). As a special case, if i =N -1 we connect N -1 to itself. 
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3.2. Concurrent Processor Systems 

In ths section we give a brief description of some of the concurrent processor 
systems which either have been implemented or are currently being designed or 
implemented. 

Coamie Cube 

The "Cosmic Cube" is an experimental hypercube concurrent processor system 
built at The California Institute of Technology and consists of 64 nodes [Se85]. 
Each node contains Intel 8086 and 8087 co-processors and 136Kbytes of main 
memory. 

Intel iPSC 

Intel is marketing hypercube concurrent processor systems in 32, 64, or 128 node 
configurations. Each node contains Intel 80286/80287 co-processors and 
512Kbytes of main memory. The communication links between nodes are 
bidirectional and have a transfer rate of 10 Mbits per second. We will have 
more to say about this system since it is available at NASA Ames and has been 
used in conjuntion with this report. 
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JPL Mark III 

The hypercube research project at JPL is directed toward the design and imple- 
mentation of a high-performance hypercube concurrent processor system[JPL85]. 
Each node contains Motorola 68020/68881 co-processors, a Motorola 68020 I/O 
processor and up to 4 Mbytes of main memory. The communication links are 
capable of transferring data at 100 Mbits per second. A 32-node prototype is 
scheduled for completion in February ,1986 and a 256-node version will be avail- 
able in January 1987. Beyond this, there are plans to include a Weitek scalar 
floating point unit (.25-2.0 Mflops) in each node and later to add a Weitek vec- 
tor floating point unit (5-10Mflops). The long-range objective at JPL is to con- 
struct a 1024-node system which, when fully configured, would have a rated per- 
formance of 5-10 Gflops and a memory capacity of 4 Gbytes. 

NCUBE 

The NCUBE is a new machine with a proprietary CPU and small local memory. 
The current version can be configured with up to 1024 nodes each with a max- 
imum of 128K bytes of main memory. The processors are interconnected in a 


hypercube configuration. 


w 
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Lawrence Livermore Laboratory 

The Parallel Processing Project at LLL is concentrating on shared memory confi- 
gurations with vector processing nodes. They are studying the performance of 
various switch designs for processors which generate memory references typical 
of today’s vector machines [BR185, BR 285]. 

The BBN Butterfly 

The BBN Butterfly is a shared memory concurrent processor system with a But- 
terfly interconnection network. Each node consists of a M68020 processor, a 
M68881 floating point co-processor and up to 4 Mbytes of main memory. There 
is a separate processor in each node called the Processor Node Controller(PNC). 
The PNC initiates all messages transmitted over the butterfly switch and is 
involved in every memory reference made by the M68020, The PNC uses a 
memory management unit to translate the virtual addresses used by the M68020 
into physical memory addresses. Physical addresses may correspond to locations 
in some other node and it is the responsibility of the PNC to initiate the data 
transfers through the switch. This translation is transparent to the M68020, 
and thus the PNC provides a shared memory view. The butterfly switch has a 
processor-to-processor bandwidth of 32 Mbits per second. A 128-node system 
has been built. 
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Los Alamos 

Researchers at, Los Alamos National Laboratory are designing a 1024 node con- 
current processor system. The nodes will be capable of performing from 10 to 20 
million non-pipelined floating point operations per second and are hypercube- 
connected. The nodes consist of two AMD 29325s, a NS32032, and at least 
16Mbytes of main memory organized into 16 banks. Each node will also contain 
a disk with about 1/2 billion bytes of storage. 

Princeton Navier-Stokes Machine 

Daniel M. Nosenchuck and Michael G. Littman at Princeton University are 
developing a concurrent processor system(called NSC) to numerically simulate 
the full Navier-Stokes equations with no modeling. Each node has 8 Mwords of 
32-bit interleaved memory and is capable of an average sustained computaion 
speed of 100 million floating point operations per second. The nodes are mesh 
connected and a 128 node prototype is under development. 

3.3. The Intel iPSC 

| 

In this section we give a more detailed description of the architecture of the 
Intel iPSC concurrent processor system. 

Cube : 


i 
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The Intel iPSC concurrent processor system consists of either 32, 64, or 128 
nodes. Each node has an Intel 80286 central processing unit and an Intel 80287 
numeric processing unit supporting 32-, 64-, and 80-bit floating point 
operations (IEEE 754). Each node has 512 Kbytes of main memory(can be 
upgraded to 4.5 Mbytes) per node. The communications between nodes is over a 
10 Mbit per second point-to-point serial channel(Intel’s 82586 communication 
processor). Each node has 8 communication channels: seven for communicating 
with neighboring nodes and one for communicating with the cube manager. 

Cube Manager : 

The cube manager is an Intel 286/310 system which consists of an 80286 central 
processing unit, an 80287 numeric processing unit, and up to 4 Mbytes of main 
memory. It comes with either a 40 Mbyte winchester disk or a 320 Mbyte disk. 
There is a global ethernet channel for communicating with the cube nodes. An 
ethernet TCP/IP network subsytem is available which is used to provide access 
to the cube manager from other hosts on a LAN. 

Software'. 

The cube manager comes with the XENIX 3.0 operating system, FOR- 
TRAN 77 compiler and a C compiler. 

Intel provides a multiprocessing operating system which is resident in each 
of the nodes and provides the following services to each of the node processes: 
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System Call 

Meaning 

eopen 

Creates a channel for node process communication. 

cclose 

Destroys the communication channel created by copen. 

send 

Initiates transmission of a message to another process. 

aendw 

Initates transmission of a message to another process and 
returns only when the channel is available for reuse. 

reev 

Initiates the receipt of a message. 

reevw 

Initiates the receipt of a message and 
blocks until the message is received. 

statue 

Informs the calling process of the availability of a channel. 

probe 

Determine if a message of the specified type has been 
received on a given channel. 

mynode 

Returns the node number of the calling process. 

mypid 

Returns the process id of the calling process. 

eubedim 

Returns the dimension of the cube. 

clock 

Returns the elapsed time(in milliseconds) since the 
node was initialized. 

flick 

Relinquishes the CPU. 


Processes are downloaded into the nodes from the cube manager. We can 
load more than one process into each node and provide user-assigned process id’s 
, for each. Additionally, as part of loading processes into nodes, the user can 
optionally specify the maximum number of open channels and the maximum 
stack size per process. 

Software for. the cube is developed on the cube manager and then down- 
loaded to the nodes. Typically, there is a cube manager process which communi- 
cates with the node processes during course of a computation. At the very least 
the cube manager process starts the computation by transmitting data to the 
nodes and collects data from the node processes a the end of a computation. 
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Communication Architecture: 

The following is a list of observations concerning the communication archi- 
tecture of the Intel system. 

Little opportunity to overlap communication with computation. Each node 
has a single CPU which is required to set up all communication between nodes. 
This effectively eliminates much of the opportunity to overlap communication 
with computation. The extent to which the overlapping occurs is provided by 
the separate 82586 communication coprocessors. We would prefer a separate 
I/O processor which is capable of independently executing its own instruction 
stream. 

Too much copying required. Communication often requires an inordinate 
amount of data moves. If the data to be transferred are not contiguous, then 
they have to be copied into contiguous locations before transfer. All data 
received are first placed in system buffers and then copied to contiguous loca- 
tions in the user’s space and, finally, if the data belong in non-contiguous loca- 
tions the user has to copy the data once more. There should be some way in 
which constant stride data can be moved from one place to another without the 
intervention of the node processor. It would be useful to have a simple DMA 
device which could do memory-to-memory transfers. 

Processes in the same node do not share address space. Processes within the 
same node should be able to share address space with each other. For example, 
if we were to have a separate communications processor, then the I/O process 
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would most naturally be required to transfer problem data to and from the node. 
The most efficient way to accomplish this would be to allow the I/O process to 
share the address space with other processes within the same node. If it does not 
share address space then the only way for the communication process to pass 
data to other processes within the node is by message passing. This would 
defeat the advantage of using a separate I/O processor. Another reason to share 
address space within a node is to permit node processes to share code. 

Performance : 

The paper by Kolawa and Otto[KO] give a number of interesting performance 
results for the Intel iPSC. This paper determines the speed of the basic opera- 
tions used in the Intel iPSC hypercube. Because of its pertinence to our report, 
we include the Kolawa and Otto paper as an appendix. 


TR-86,7 


■ n- 


March 18, i960 


4. The Equations 

The unsteady, three-dimensional Navier-Stokes equations in Cartesian coor- 
dinates (z ,j/ ,z ,t ) are taken as the basic set of equations[Lo82). The Cartesian 
space represents both the physical domain and the the computational domain. It 
is known that the physical domain can be transformed into other curvilinear 
coordinates thus making it possible to treat a wide variety of geometries using 
one basic set of equations over a simple computational domain, These transfor- 
mations introduce additional metric terms into the basic equations. We have 
chosen not to include these terms in order to simplify our presentation. Obvi- 
ously, the elimination of the metric terms reduces the computational burden 
and so later in the report we will determine the effect of the metric terms on our 
performance estimates. 

The three-dimensional Navier-Stokes equations are given by: 

f + h E - E ^ + & F + ~ G °> = ° m 

where 


Q = [p pu pv pw e] 1 , 

E = \pu puu +p pvu pwu (e+p)u] < , 

1 = “T” [0 T Z y Tyy T ;y fly ] , 


^ Tzl ^ yz Tzl ^ ’ 


F = [pt> puv pvv +p pwv (e+p)v 


Re 


G = [pw puw pvw pww +p (e +p )«/] ( , <?„ = ~[0 t xz T yz r zz 0 Z ] 1 


3u , dv , dw 


du 


d u . dv , 
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The velocity components u .v . and u> are made dimensionless by a gg, the 
freestream speed of sound, the density p is made dimensionless by p x and the 
total energy e by p^ a £ . The pressure p is given by 

( 7 - 1 ) (e -0.5p(u 2 -+- v 2 4 -to 2 ) ) where 7 is the ratio of specific heats. Also, k is the 
coefficient of thermal conductivity, // is the dynamic viscosity, and A from 
Stokes’ hypothesis is -2/3//, The Reynolds number is Re and the Prantl 


number is Pr . 
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5, Numerical Simulation Methods 

In this section we describe two methods for numerically simulating the 
Navier-Stokes equations. The first method is an explicit procedure of the Lax- 
Wendroff type|Am69] and the second is the implicit factored method developed 
by Beam and Warming[BW78]. Before presenting these methods we provide 
some useful notation and definitions. 

The computational domain D is the set of spatial points over which we 
attempt to numerically simulate the evolution of the dependent variables p , pu , 
pv , pw , and c . The points in D are called grid points and they form a uni- 
formly spaced "grid" over D . Let Az Ay , and Az denote the grid spacings in 
each to the coordinate directions, respectively. Let there be / grid points in the 
x -direction, J grid points in the y -direction, and K grid points in the z- 
direction. 

Formally, we define D as 

D - { (x ,y ,z ) \ x =i Ax , y = j Ay , z -k Az yshere i , j , and k are positive integers 

and i ^ / , j < J , and k ^ K } 

The grid points in D are represented by triples of indices ? , j , and k . 

That is, (» ,; ,A ) denotes the grid point (z ,y ,z ) with x =i Az , y =j Ay , and 
z =k Az . 

The dependent variables are simulated over a discrete set of points uni- 
formly spaced in time. The difference between successive simulation times is 
called the time step and is r’enoted by At . 
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The notation />,"* refers to the quantity p at spatial location (x ,p ,x ) and 
time t where x «i Ax , y*;Ay,z Ax , and t *n At , Extending this nota- 
tion we have, 

Q, n jk “ [Pijk P< 3 k P< } k P<,k , 

and 

= e(q," 4 ) , 

and so on. 


5.1. Lax- Wendroff Explicit Method 

The Lax-Wendroff method is an explicit procedure for computing 
from Qj’jif for all grid points (i ,/ ,k ). Part of this procedure involves computing 
intermediate values which are associated with the spatial indices 
(t +1/2, j +1/2, k +1/2) and the time parameter n +1/2. 

We introduce some notation for describing differencing and averaging 
processes. Let 

xfaee (»' J ,k) = {(t ,/,*), (» J +1,*), (» J £ +1), (i ,j +1 ,k +l) } , 

yfaee (t ,j ,fc) = {(i ,j ,k), (» +1 ,j ,k), (» ,/ ,fc +l). (i +1,/ ,k +1) } , 

zfaec (i ,j ,k) - {(> ,j ,fc), (t +1,; ,fc), (» ,j +l,fc), (f +1 J +1 ,k) } , 

and 


cube (i ,k ) = xfaee (» ,j ,k ) (J xfaee (i +1 ,/ ,k ) . 
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If in some quantity, say pfy , we replace the triple i , j ,k by a set of points, 
say zfaee (i ,j ,k), then the notation denotes the sum of the values of p n over all 
points in xface (i ,k). For example, if r denotes the triple (i ,j ,k) then 

Pifatt{r ) = S P> • 

9 e if act (r ) 

Finally, we define, by example, the numerical differencing operators 6 X , 6 y , 


and Sg . 


S zP,% = 


Filk 


Pzfact (i +!,;,<? ) Pxfact (i ,j ,k ) 
4Az 


Fyfact (i ,) +l,k) ~ Fyfact (« ,* ) 

4A y 


t t . \ n ( e / )zfatt +l) i e l) zfact (i ,j ,k) 

6 ‘ <*' >■* 7T, 


It should be clear from the above examples how the numerical difference 
operator works. 

Let {i ,j ,k) denote a triple of indices (not necessarily integer valued). 
Then we define (i‘ ,j ,k) + to be (i +1/2,.; +1/2, k +1/2) and (i ,j ,k)~ to be 
(i -1/2, ; -1/2, A: -1/2). 

Finally, we define the sets C and I as 

C = {(» ,j,k) + | (« J ,k)eD } p| {(* ,k)~ | (i ,/ ,k)eD } , 

I = {r | r cD and r + e C and r ~eC } , 

and 
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B - D - I . 


The grid points in B are called boundary grid points, those in I are called 
interior grid points, and those in C are called central grid points. 
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Lax- Wcndroff Method 

Input: Q r n is given for all r tD and some nonnegative integer n . 

Output: Q r n +1 for all r cD 

Method: 


Step 1: For all r eC compute 

n » + 1/2 _ ^ cukt ( f ~) 
Vr 8 



+ 6 y f;. + 6 U ) . 


Step 2: For all r eC compute 

E r n+l ! 2 = E {Q t n +1 / 2 ) , 


F r "+i/2 = /r(Q r » + 1/2) ) and 
G r " +1 /2 = G[Q r n +1 / ? ) . 


5tep S: For all r eC compute ( E v ) r n , ( F v ) r n , and ( G v ) r ". Some examples of 
these computations are: 


( r «* )• = X ( S z u r" + ^«V") + 2//<S I U f n - , 


and 


77*. («/ + } K".h (,-) ('» )r” + (,-) (>i„ ),” + <*. (,-) ),”) 


5<ep For all r «I compute 

<?," *' - e,“ - At (i, (B r "- + ‘/ 2 - (E, + S s (Fy'i* - (F „ + s, (cy'i* - (g , ),“)) . 


Step 5: 

For all r eB compute Q r " +1 from Q r n according to the appropriate boundary 
conditions. 
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5.2. Beam and Warming Implicit Factored Method (Berger’s 
Equation) 

Implicit methods have been proposed for the numerical solution of various 
forms of the Navier-Stokes equations. Implicit methods are more complex than 
explicit methods since the former usually require the solution of a large number 
of systems of equations. However, implicit methods have improved stability pro- 
perties over explicit methods thereby permitting a larger At . They have the 
drawback that they require significantly more computation than explicit 
methods. 

The numerical method we shall consider is based on the work of Beam and 
Warming. The formulation of the method by Beam and Warming actually 
includes a number of different methods depending on the choice of certain 
parameters. We will not give the complete details of the method since they can 
be found in [BW78] and [Pu84]. 

As before, our objective is to determine Q ” +1 given Q k where k ^ n 
(Notice that we admit the possibility that Q n +1 may depend on more than just 
the immediately preceding time-step). The temporal scheme for advancing time 
is given by 


i 


fcrSt: 
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A «” = TT7l A<? ” + TT?£«" + 7Tf A<r '‘ 

+ O[{0 - 1/2 - £)At 2 + At 3 ] , 


where Q n = Q (n At) and A<?" = Q n +1 - Q n . The choice of 0 and £ 
reproduces many two and three-level, explicit and implicit schemes, 

The Navier-Stokes equations are solved for dQ /d 1 and then substituted 
into the temporal scheme given above. This results in a nonlinear set of equa- 
tions for AQ n . A linear set of equations is obtained by the use of Taylor series 
expansions of various terms. For example, E n+l is replaced by 

E n+i = E n + (|^) n (Q n+1 - Q n ) + O (At 2 ) 
dQ 

We have implemented the Beam and Warming implicit factored method for 
an equation called Berger’s equation rather than the full Navier-Stokes equa- 
tions, It was felt that the time spent in developing the code for the full Navier- 
Stokes equations would be excessive and that the basic issue concerning the per- 
formance of an implicit method could be resolved with the simpler equation. 
Furthermore, the'.implicit method for the Navier-Stokes equations leads to a 
large number of block tridiagonal equations whose simultaneous solution requires 
a large amount of intermediate storage which is not available on the Intel hyper- 
cube. Burger’s equation has only a single dependent variable and the implicit 
method gives rise to a large number of (scalar) tridiagonal equations. Both of 
these factors mitigate the storage requirements and permit the use of reasonably 


m 

I- 


I 
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sized domains. 

From a performance evaluation point-of-view, it appears that nothing is lost 
by using Berger’s equation. The reason is that the computational requirements 
for the Navier-Stokes equations are much greater than for Berger’s equation but 
the communication requirements for Navier-Stokes are greater to a much lesser 
extent. This implies that efficient implementations for Berger’s equation should 
be more difficult to find than for the Navier-Stokes equations. 

Berger’s equation is given by 


at ax' 


*> + £(/-*.) + £(« -o.)-o 


where Q = u , and 


E = u 2 , 
F = u 2 , 
G = u 2 , 


El 

E„ = U-— 


F„ =u 


G„ = u 


dx ’ 

du 
dy ’ 

du 

dz 


Setting £ = 0 and 6 = 1/2 in the temporal scheme yields the 
implementation: 


scheme used in our 
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After linearization and factorization the temporal scheme can be written as 

L x L y L z AQ n ** H (5.1) 

where L x , L v , and L x are operators given by 


£. ={1+A<I(£)“” 

L„ «{l+AI[(±)i,- 


2 dx 2 


"JL 

2 by 2 


]} 

']} 




? 


L, ={1+AJ|(^)«" 


±JL 

2 a z 2 


]}, 


and 


9u ' 


3 2 u 




) +( 


a 2 u 


du 2 
dy V dy 


du 


9 z u 


)+(—* tt)I 


dz 


a z‘ 


Equation (5.1) holds pointwise in the spatial coordinates and relates the 
dependent variable at the various time steps. The important point about the 
operators, L x , L y , and L x is that they each involve spatial derivatives in a sin- 
gle coordinate direction. 

Letting X = L y L z AQ " and Y - L x AQ n we can rewrite equation (5.1) 
as a sequence of equations which corresponds to the actual implementation 
sequence. 


L x X = H , 


(5.2a) 


^HBKi 




Li 
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(5.2b) 


and 


L z AQ n = Y . 


(5.2c) 


The idea is to first solve equation (5.2a) for X and then we use X in (5.2b) 
and solve for Y . Finally, we use Y in (5.2c) and solve for AQ ” . 

We obtain the basis for a numerical algorithm by approximating the spatial 
derivatives with finite-difference quotients. We assume a computational domain 
D as defined at the beginning of section 5. When we substitute finite difference 
quotients (three-point central-difference) for the spatial derivatives in equation 
(5.2a) we get a system of difference equations of the form 


Cx , _i-Y, -i + Ax( X{ + Bij +l X( +i - if, , 


(5.3a) 


where 


a i , A t u 
A x i - 1 + 2 1 

Ax 

n A t / V . 

Bz ' - - 2aT ( “' + ’ 


and 


n At i 
Cl: = (U: 

' 2Ax v ' 


Ax 


). 


for 1 < i </ . 


In equation (5.3a) we have suppressed the j and k indices. The dependent 
variable X and the coefficients are defined for each grid point and therefore we • 
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should write them as , Az,' ; * , etc. However, we drop the j and k indices 
since we are assuming that the suppressed indices are identical throughout 
(5,3a). This will be the usual assumption for suppressed indices. Thus, according 
to (5.3a) we get one system of equations for each pair j , k corresponding to 
interior grid points. 

We obtain similar results by approximating the spatial derivatives in (5.2b) 
and (5.2c) with finite-difference approximations, namely, 

CVj -i Yj + Ay; Y, + By } +1 Y, +1 = X, , (5.3b) 

for each t , k corresponding to an interior grid point, and 

Cz„ ..Aft"., + Az„ A ft" + Bz t +1 A ft" +1 = Y t , (5.3c) 

for each t , j corresponding to an interior grid point. 

Boundary conditions enter the picture when the terms in equation (5.3) 
depend on values associated with boundary points. Just as in the case of the 
explicit method, the boundary conditions are problem specific and therefore it is 
difficult to say anything general about them. Usually equations (5.3) result in a 
set of tridiagonal equations. However, if the boundary conditions are periodic in 
the x-direction then equations (5.3a) result in a periodic tridiagonal system of 
equations for which solution algorithms are available [Te75]. 


TR-86.7 


- 35 - 


March 18, 1988 


We show by example how the boundary conditions can affect the form of equa- 


tions (5.3). Let 7*7. Then equation (5,3a) for a fixed j and k , in matrix no- 
tation, is 


Cx t Az 2 Bz 3 0 0 0 0 

0 Cz 2 Az 3 Bz 4 0 0 0 

0 0 Cx 3 Ax 4 Bx 5 0 0 

0 0 0 Cz 4 Ax 5 Bz 6 0 

0 0 0 0 Cz 5 Ax 6 Bx 7 


*1 

x 2 

*3 

*4 

*5 

*6 

*7 


h 2 

*3 

Hi 

H, 

*6 


(5.4) 


We have 5 equations and 7 unknowns and thus we need additional condi- 
tions to completely specify the system of equations. These additional conditions 
are obtained from the boundary conditions of the particular problem at hand. 
For example, if the boundary conditions are periodic in the x -direction, then 
Q\ - Q 6 an ^ ?2 ~ Q? ■ This implies that X j - X 6 and X 2 - X 7 . Substi- 
tuting into (5,4) we get the following periodic tridiagonal system of equations. 


Ax 2 Bz 3 0 0 Cx x 


' *2 


H 2 

Cx 2 Ax 3 Bx 4 0 0 


*3 


Hz 

0 Cz 3 Ax 4 Bx s 0 


*4 

= 

Hi 

0 0 Cx 4 Ax 5 £x 6 


*5 


Hs 

Bx 7 0 0 Cx 5 Az 6 


. * 6 . 


. H 6 . 



4 
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Next suppose that Q" is fixed at some freestream value and Q 7 * . It fol- 

lows that X i * 0 and X 7 = X 6 , Substituting into (5,4) we get the usual tridi- 
agonal system of equations. 


Ax 2 

Bx 3 

0 

0 

0 


' X 2 


H 2 

Cx 2 

Ax 3 

Bx 4 

0 

0 


*3 


#3 

0 

Cx 3 

Ax 4 

Bx 5 

0 





0 

0 

Cx 4 

Ax 5 

Bx 6 


X , 


#5 

0 

0 

0 

Cx 5 

(Ax e + Bx 7 ) 


X 6_ 


. *6. 


The above examples were intended to show how boundary conditions affect 
equations (5,3). Boundary conditions also come into play in computing, H , the 
right-hand-side of equation (5.3a). The approximation of the spatial derivatives 
in H in (5.1) with finite-difference quotients is affected by the boundary condi- 
tions since some of the terms in H contain spatial derivatives in each of the 
coordinate directions. Therefore, when the indices of are adjacent to a 
boundary grid point, then the boundary conditions are taken into account. 


We are now in a position to state the numerical algorithm. 


3MK* 
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)3eam and Warming Implicit Factored Method 


Input: Q/jit for all ( * ,k )<D . 

Output: Q, n jk +i for all (i ,j ,k )tD . 

Method: 

Step 1: (Compute X |; * for all interior grid points) For each X < j < J and 
1 < k < K solve equation (5.3a) for 1 < « < I , 

Step 2: (Compute Y,^ for all interior grid points) For each 1 < i < I and 
1 < k < K solve equation (5.3b) for 1 < ; < J . 

Step S: (Compute A Q,"* for all interior grid points) For each 1 < i < I and 
1 < j < J solve equation (5.3c) for 1 < k < K . 

Step 4 : (Update the dependent variables) Set 


for all (i ,j ,Ar )cl . 

Step 5: Update all values of for (i ,/ )cB . 


4 
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0. Requirements for CFD 

6.1. Storage Requirements for CFD 

We characterize the storage requirements for CFD in terms of the size of 
the computational domain D and the amount of storage required per grid point. 
The amount of storage per grid point depends on a number of factors, The 
dependent variable Q occupies 5 floating point words per grid point for each 
time step and even with an explicit method it is sometimes convenient to store 
the value of Q n while computing Q n +1 . If we have transformed the original 
problem from a physical domain into the computational domain, then there are 
additional metric terms which are associated with each grid point, Also, when 
using an implicit numerical scheme, intermediate results are usually generated by 
algorithms for solving block tridiagonal systems. For example, forward substitu- 
tion increases the storage requirements by 30 additional floating point values per 
node. Whether we must store all these intermediate values simultaneously 
depends on the particular implementation strategy. 

We define gp_per_node to be the number of grid points per node, 

bytes _per node to be the amount of main memory per node devoted to storing 

the data, and val _J>cr__gp to be the number of floating point values associated 
with each grid point. We assume that it take 4 bytes to store one floating point 
value. It is obvious that, 


gp per node 


bytes per node 
4 val per_node 


H3SS 
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6.2. Computational Requirements for CFD 

In the following we develop some straight-forward relationships which give 
some insight into the factors which determine the computational requirements of 
CFD, As we shall see, an important measure of the capability of a concurrent 
processor system is given by the product of the number of nodes times the sus- 
tained floating point computation rate of each node, In the table below we 
define some of the important terms, 


NAME I 

MEANING 

N 

1 Number of nodes. 

!DI 

1 Number of grid points in the domain. 

gp ~j)cr~node 

Number of grid points per node(defined above). 

9P~JP er 1 *Jt e 

Rate at which grid points are updated. 

flop_per__gp j 

Number of floating point operations to update one grid point, 

flop per__««c i 

i Sustained rate at which node can perform floating point operations. 


| The number of seconds to update all grid points in the domain. 


i i.e,, The number of seconds to advance the solution by one time step. 


There are some obvious relationships among the above quantities, namely, 

SP—JP er —. n °d e = |D|/ N (l) 

flop _per_sec = flop _Jptr_gp gp_per_see (2) 

sec_per~j8 = gp_per_node / gp _pcr_sec (3) 

We use “node complexity” to refer to the computational capability of each node. 
For example, a node with a bit-serial CPU would have a very low node complex- 
ity while a node which consists of a CRAY CPU would have a high node com- 
plexity. The sustained floating point operation rate of a node is a reasonable 
measure of its complexity, and the product of the floating point operation rate of 
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each node times the number of nodes is a useful figure of merit for a concurrent 
processor system. 

Using the above equations we obtain the product of flop _per_see and N . 
It is interesting that this product is determined by the domain size, the number 
of floating point operations per grid point (per time step), and the number of 
seconds allowed per time step. The quantities on the right-hand-side are meas- 
ures of the computational demands of the problem and the left-hand-side is a 
measure of the computational capability of the concurrent processor system. 


N flop per see 


D| flop per gp 
see per is 


For example, if we have that the number of floating point operations per grid 
point is 2K, and if we require that the number of seconds per time step not 
exceed one, and that the domain contains 256 3 grid points, then we get: 


N flop _per_sec = 2 35 . 

Thus for a system with 1024 nodes, each node must be capable of a sustained 
processing rate of 32 million floating point operations per second. The rated 
performance of the proposed Los Alamos machine comes close to meeting this 
requirement. If we have more nodes (no machine has been proposed with more 
than 1024 nodes and significant floating point capabilities.) then a diminished 
floating point capability would suffice. On the other hand, if we want to do the 
job with an 8-node system, then each node would have to achieve a sustained 
rate of 4 billion floating point operations per second. 


TR-86,7 


* 41 - 


March 18, 1986 


The Princeton machine will have 128 nodes and therefore each node would 
have to achieve a sustained rate of 256 million floating point operations per 
second. The predicted performance of each node is 100 million floating point 
operations per second. 


7. Lax- WendrofT Method 
7.1. Implementation 

In this section we describe an implementation of the Lax-Wendroff method. 
The program was written in C and was run on a 32-node Intel hypercube. 

The obvious way to implement the Lax-Wendroff method is to partition the 
computational domain D into subsets, which we call cells, and to assign each of 
these cells to a different node in the hypercube. Recall that the domain D has I 
grid points in the x -direction, J grid points in the y -direction, and K grid 
points in the z -direction. Each cell is a "box" of grid points with II grid points 
in the ^-direction, JJ grid points in the y -direction, and KK grid points in the 
z -direction. The indices i , j , and k are used to refer to grid points within a 
cell and these indices range from 1 to II , JJ , and KK , respectively. 

Throughout we assume tha t / , J , K , and II , JJ , KK are powers of 2. 
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1 


In the Figure we show a computational domain partitioned into 16 cells. 
Each cell is identified by the coordinates a , b , and c . In general, we have AA 
cells along the x -direction, BB cells along the y -direction, and CC cells along 
the z -direction. The cell coordinates range from 0 to AA -1, BB -1, and 
CC -1, respectively. It is easy to see that 


AA 


I_ 

II 



and 


CC 


K 

KK ‘ 


We define ABIT , BBIT and CBIT as follows: 
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A A = 2 abit ; BB = 2 BBIT ; CC = 2 CBIT 

In this implementation we assign each cell to a node according to the function 
cell_to__node which maps the cell coordinates into a node number. In prepara- 
tion for defining this function we introduce the function gray (r ,e ) which is 
defined for all integers r and a where 0 < r < 2* . The value of gray (r ,a ) is a 
Gray code on a bits for the integer r . This function has the important property 
that gray (r ,s ) and gray (r +1 mod 2 a ,s ) differ in exactly one bit. Finally, 

cell_to node (a ,b,c ) = gray (e ,CBIT ) * gray ( b ,BBIT ) * gray (a ,ABIT ) 

where ' denotes concatenation. The importance of the cell to node mapping is 

that "adjacent” cells map into "adjacent" nodes. Nodes are adjacent if they 
have a direct communication link between them and cells are adjacent if they 
differ by one in exactly one coordinate. Clearly, adjacent cells will have to 
transfer data between themselves and so it is advantageous that they be mapped 
to adjacent nodes. Certain cells which are not adjacent are also required to 
exchange data, but these cases do not dominate and in the worst case the data 
passes through two intermediate nodes. 

Since there are A A *BB*CC nodes (there is a one-to-one correspondence 
between cells and nodes), the dimension of the hypercube must equal 
ABIT + BBIT + CBIT . 

Each node must compute Q n +1 for all grid points in its cell. This evalua- 
tion uses the values of Q n associated with grid points in other cells. It turns 


TR-88,7 


-44- 


March 18, 1986 


out that cell (o ,6 ,e ) will require certain values of Q n from cells with indices 
(a ±1,6 ±l,c ±1). In our program the dependent variable Q is represented by 
five 3-dimensional arrays called d , du , dv , dw , and e , corresponding to p , pu , 
pv , pw , and e , respectively. The arrays are dimensioned II +2 in the x - 
direction, JJ +2 in the y -direction, and KK +2 in the z -direction. As was men- 
tioned earlier, the indices i , j , and k are used to refer to grid points within the 
cell and they range from 1 to II , 1 to JJ , and 1 to KK , respectively. These 
arrays are oversized to make room for values associated with grid points in 
"neighboring" cells. For example, if our reference cell is (a ,6 ,c ) , then 
d [0] [lj [iiCiT +l] is the density associated with the grid point // , 1, 1 located in 
cell ( a -1,6 ,c +1). The values for variables such as d [0] [l] +l] in cell 
(a ,6 ,c ) are obtained by explicit communication with the node that contains the 
cell (a -1,6 ,e +1). 

The node program is the following: 
init_data(); 

for( i = 1; i <= ITER; i-j — (- ) { 
xfer_data(); 
sweepQ; 

} 

The init,_data() routine initializes the data arrays. 

The xfer data() routine specifies all communications between the nodes. 

The idea is to transfer into each node sufficient information from "neighboring" 
nodes so that each node can advance the solution to Q n +1 for all of its grid 
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points. Each node uses nonblocking receive system calls(recv) to establish 
buffers for the incoming data, then uses blocking send system calls(sendw) to 
transmit data to neighboring nodes, and finally waits until all the anticipated 
data arrives. 

The sweep() routine performs the Lax-Wendroff computation for one time 
step. SweepQ corresponds to Steps 1 to 5 of the Lax-Wendroff method 
presented earlier. The implementation of this routine takes into account that 
the terms Q r n +1 / 2 , E r n +1 / 2 , (E„ ) r ”, etc. for r cC are common to the evaluation 
of Q " +1 for eight different grid points. The grid points (i ,j ,k) corresponding 
to a fixed value of k , (called a k -plane ) depend on terms evaluated at central 
grid points with r =(t ,j ,k± 1/2). Accordingly, we compute the values associ- 
ated with central grid points for two successive k-planes of central grid points. 
This enables us to evaluate Q n * 1 for all grid points on the k-plane in between 
the two k-planes of central grid points. After this we compute the next k-plane 
of central grid points, r =(» ,j ,k +1+1/2) using the space occupied by the values 
associated with r =(t ,j ,k -1/2) . In this manner we "sweep" through the 
domain and at any time we need only store the values associated with central 
grid points associated with two k-planes. 
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7.2. Efficiency 

Wc define the efficiency, , of an algorithm, 4 , as follows: 

where N is the number of nodes, t P p< is the time to solve the problem on one 
node using the best possible algorithm, and tj$ is the time to solve the problem 
on N nodes using algorithm A . Clearly, t ° pt /N is the fastest time we could 
achieve using N nodes and so e' 4 is not greater than 1. 

We can estimate the theoretical efficiency of our implementation of the 
Lax-Wendroff* method. Each cell contains v (cell ) grid points and 9 (cell ) grid 
points on the boundary of the cell. We estimate the values of t ° pt and t$ for 
one time step as follows: 

t ° pi = N v ( cell )t calc d 

and 

t$ = v ( cell )t calc d + 9 {cell )t eomm e , 

where t ca i c is the time for a floating point operation, d is the number of floating 
point operations required per grid point, t eomm is the time required to transfer 
one floating point number between adjacent nodes, and c is the number of float- 
ing point numbers that are transferred per boundary grid point. We are glossing 
over a few details which do not change the qualitative nature of this efficiency 
estimate. For example, not all floating point operations require the same 
amount time, and the number of floating point numbers that are transferred 
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between cells depends on the cells , and not all transfers are between adjacent 
nodes. 

Using the above equations and after some simplifications we get 

c > 1 _ ^comm C 

€a ' v (cell ) t eale d 


Next we estimate d(cell ) and v ( cell ) in terms of // , JJ , and KK . The result, 
after simplification, is 


e A 




+ 


JJ KK' t 


)■ 


t 


comm 


calc 


It is easy to see that the efficiency grows as 1 - O flDl" 1 / 3 ). Also, it is expected 
that d »c and so this tends to improve the efficiency. In the case of the Intel 
cube the ratio t comm / t cai( is 149 [KO], but this is offset by the other terms. 


7.3. Performance 

The code has been instrumented to count the total number of floating point 
operations performed and to determine the amount of time devoted to the com- 
putation and the communication. We worked with a cell size of II = 8, 

JJ = 10 and KK = 10. Therefore each cell contained 800 grid points and since 
there are 32 nodes, D contains 25,600 grid points. The total time for a run in 
which ITER was 10 took 309 seconds and each node performed 4,594,680 float- 
ing point operations. This amounts to a total of 475,936 floating point opera- 
tions per second. 
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Out of the 300 seconds, 288 seconds were devoted to computation and 21 
seconds were devoted to communication. A crude estimate for the efficiency is; 


€ lax 


288 

21 + 288 


0.932 . 


The same code, specialized to a single node, was run on a VAXll/780 and we 
found that the VAX maintained a sustained rate of approximately 32,000 float- 
ing point operations per second. Thus the 32-node hypercube is almost 15 times 
faster than the VAX. 

The Lax-Wendroff code has not been extensively tested and there is a prob- 
lem with the way in which the C deals with NaNs generated by the 80287 copro- 
cessor. The NaN problem is expected to be cleared up shortly. The next phase 
will be to test this code extensively for a variety of boundary conditions and 
domain sizes. 


8. The Beam and Warming Implicit Factored Method 
8.1. Implementation 

We have implemented the Beam and Warming implicit factored method for 
an equation called Berger’s equation rather than the full Navier-Stokes equation. 
It was felt that the time spent in developing the code for the full Navier-Stokes 
equations would be excessive and that the basic issue concerning the performance 
of an implicit method could be resolved with the simpler equation. Furthermore, 
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the implicit method for the Navier-Stokes equations leads to a large number of 
block tridiagonal equations whose simultaneous solution requires a large amount 
of intermediate storage which is not available on the Intel hypercube. Burger’s 
equation has only a single dependent variable and the implicit method gives rise 
to a large number of (scalar) tridiagonal equations. Both of these factors miti- 
gate the storage requirements and permit the use of reasonably sized domains. 

From a performance-evaluation point-of-view, it appears that nothing is lost 
by using Berger’s equation. The reason is that the computational requirements 
for the Navier-Stokes equations are much greater than Berger’s equation but the 
communication requirements for Navier-Stokes are greater to a much lesser 
extent. This implies that efficient implementations for Berger’s equation should 
be more difficult to find than for the Navier-Stokes equations. 

The following is Berger's equation: 


3m 3u 2 3u 2 3u 2 

dt bz 3 y 3 r 



d l u 

+ 7T + 

3y 


3 2 u , 
02 2 


= 0 


The most important aspect of the implementation is the mapping of the domain 
D to the nodes. Consider Step 1 of the Beam and Warming method where we 
solve the equation 

Cx{ _i Jf,' _i +• Ax{ Xj + Bx( +\Xj = H( i 


for all j and k . Remember that the j and k subscripts are suppressed and Cij 
and Bxi depend on the value of u associated with grid point ijk . 
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We use Gaussian elimination to solve the tridiagonal systems, The algo* 
rithm consists of a forward sweep in which we eliminate variables and a backward 
sweep where we back-substitute to find the solution. The efficiency of our imple- 
mentation of Gaussian elimination depends on the mapping of grid points to 
nodes. A desirable mapping would map all grid points with the same j , k coor- 
dinates into the the same node. If this were so, then the above equation, for a 
particular choice of j and k , could be solved completely within the node 
without communication with neighboring nodes, except for the case of H, . 

Next consider Step 2 of the Beam and Warming method where we solve the 
following equation. 

^Vj- \ ^ j -\ A]/] Yj -f Byj + iYj + \ " Xj , 

for all i and k . In this case the t and k subscripts are suppressed and the Xj s 
are determined by Step 1. A desirable mapping for Step 2 would map all grid 
points with the same t and k coordinates to the same node. If this were the 
case then the above equation, for a particular choice of f and k , could be solved 
completely within a node. 

Finally, if ' "e consider Step 3 of the Beam and Warming method we find yet 
another preferred mapping, one that maps grid points with the same i and j 
coordinates into the same node. 

In the Figure we show the preferred mappings of the grid points into the 
nodes. X_solve corresponds to Step 1 and shows how the preferred mapping 
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would map grid points lying along a line in the x-coordinate direction into the 
same node. If the whole domain were partitioned and mapped in this manner, 
then all the nodes could compute in parallel, each one solving the set of tridiago- 
nal systems corresponding to the x-coordinate lines it contains. The Figure also 
shows the preferred mappings for Steps 2 and 3, indicated as y_solve and 
z solve , respectively. 



The preferred mappings for each of the three Steps are not compatible and 
because of the communication costs it seems undesirable to change the mapping 
in between Steps.’ We are motivated to look for mappings which can be main- 
tained throughout the computation and which are efficient at each Step. 

As before, we partition the computational domain D into cells and assign 
each cell to a node. However, we do not insist that each cell be mapped into a 
distinct node. Each cell is identified by its coordinates a , b , and c where, 
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0 ^ a < AA - , 0 < 6 < BB » ~jr , 0 ^ e < CC « , 

II JJ A A 

and 

AA *2 ABir > BB ~2 bb,t ’ CC ~2 cb,t . 

Before discusing the effect of various mappings, we describe the our imple- 
mentation in terms of cells and eell processes. We assign a process , called a cell 
process , to each cell. This means that if a mapping assigns r cells to a particu- 
lar node, then the node will contain r processes, one corresponding to each cell. 
Each cell process determines its own cell coordinates a , 6 , c , from its node iden- 
tifier, its process identifier, and the ctll_to_node mapping. It turns out that 
our implementation is relatively easy to specify in term of the cell processes. 

Each cell process is given by: 

init_data(); 

for(i = 1; i <= ITER; i++){ 
xfer_data(); 
x__solve(); 
y_s°lv e (); 
z_solve(); 

} 

The init dataQ routine initializes the value of u in each cell. 

The xfer_data() routine transfers the values of u between cells. Cell 
a ,6 ,e needs values of u from cells a ±1,6 ,c , a ,6 ±l,e , and a ,6 ,c ±1. The 
transmitted values are those which are required to evaluate H in equation (5.1). 
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The x_jK>lve() routine consists of forward and backward sweeps along the 
x-direction for each value of j and k where 1 </ ^JJ and l^k ^KK , In gen- 
eral, each cell process does only a portion of the forward and backward sweeps 
along the ^ direction for each j and k . The forward and backward sweeps 
along a "line” in the x__direction will span AA different cells. Cell processes 
a ,6 ,e with o *1 can begin their forward sweeps immediately. When all the 
sweeps reach i =// , the intermediate results are sent to cell processes with coor- 
dinates 2,6 ,c . In general, a cell process a ,6 ,c with a #1, must wait until it 
receives intermediate results from cell process a -1,6 ,c before proceeding with 
its forward sweeps. If a ^ A A —1, then when all the forward sweeps reach i = II , 
intermediate results are sent to cell process a +1,6 ,c . 

Similarly, when a cell process a ,6 ,c with a =AA -1 finishes all its forward 
sweeps, it begins back substitutions for all j and k . When all the back substi- 
tutions reach i =1, the intermediate results are sent to the cell process with coor- 
dinates AA -2,6 ,c . In general, all cell processes a ,6 ,c with a j^AA -1, after 
completing their forward sweeps, wait until they receive intermediate results 
from cell processes a +1,6 ,e before they perform their portions of the back sub- 
stitutions and send their intermediate results to cell process a — 1,6 ,c , 

Since each node contians more that one cell process, all waiting must be 
structured to relinquish ihe node CPU. Typically, a cell process waits for data 
to arrive on a channel. The wait code is: 
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E*Era35*BE5 i**£T*3GZ'. .7Z3H.ZZ-, 


whHe(status(channel))flicM(); 

Note that cell processes finish sweeps for all j and k before sending intermedi- 
ate results. Another alternative would be to send intermediate results for each 
j and k . This strategy leads to much more communication overhead and would 
be intolerable on the Intel iPSC. In a system with an efficient and independent 
I/O processor this method might be effective in eliminating all waiting for inter- 
mediate results. 

8.2. Cell-To-Node Mappings 

In this section we discuss the gross effect of the cell-to-node mapping on the 
efficiency of our implementation. A reasonably complete discussion of these 
issues can be found in [CSS85] and [JSS85] where various implementations of the 
Alternating Direction method are presented and analyzed for the two- 
dimensional case. 

Suppose we have a 3-dimensional hypercube and AA = BD = CC = 2. In 
the Figure 8.1a we show an adjacency preserving 1-1 mapping of the cells into 
the nodes of the hypercube. It is easy to see that during the the course of the 
computation that at least half of the nodes will be idle at every point in time. 
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Next consider a 2-dimensional hypercube and a computational domain parti- 
tioned into 8 cells. The cell-to-node mapping in this case is shown in Figure 
8.1b. Since each node contains a cell with a =1, during the x_solve phase all 
nodes will have work to do and are initially active. Furthermore, all the nodes 
contain a cell with a = 2 so that all the nodes will remain active after the cells 
with a = 1 have finished their portion of the forward sweep. Of course, all the 
nodes will be idle during the time that the partial results from the forward sweep 
are transmitted from the cells with a = 1 to the cells with a = 2. The the 
situation is the same for the backward sweep. 

The y_8olve phase is not as favorable since only half of the nodes contain 
cells with b =1. Therefore at least half of the nodes are idle throughout the 
y solve phase. 

The z__solve phase is similar to the x_phase , since all the nodes contain 


cells with e = 1 and c =2. 
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These examples demonstrate the nature of the relationship between the 
cell-to-node mapping and the potential efficiency of our implementation. It fol- 
lows that the best case would be to have every node contain exactly one cell for 
each different value for a , b , and e and have adjacent cells map into adjacent 
nodes or the same node. This would mean that in each "solve" phase, every 
node would be busy except for the time during the transmission of intermediate 
results to neighboring cells. We have not achieved this condition in the previous 
example since node 0 contains two cells with 6 =1. It is easy to show that it is 
not possible to achieve such a mapping for any hypercube with dimension less 
than 6. It is also clear that the dimension of the hypercube would have to be 
even, say 2 d , and that AA = DB = CC = 2 d . It is not obvious whether such 
cell-to-node mappings exist. 

We might relax the above conditions by requiring that each node contain at 
least one cell for each different value for a , b , and c and that adjacent cells 
map into adjacent nodes. We can always find cell-to-node mappings which 
satisfy this condition. An example is shown in Figure 8.2. The difficulty with 
such mappings is that additional communication and cell processes are required. 
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In this case cc!!a which are adjacent in the z direction are mapped into nodes 
which are at a distance 2 from each other. All other adjacencies are preserved. 


8.3. Performance 

The code for the Beam and Warming method was run using the cell-to-node 
mapping shown in Figure 8.3 on a 16-node hypercube with 4 cell processes per 
node. We have II = JJ = KK = 5. This amounts to a domain with 20 3 grid 
points. The total number of floating point operations is approximately 214,000 
and the time per iteration (x _solve, y_solve, and z _solve) is about 2.5 seconds. 
This is a sustained rate of 5.350 floating point operations per second per node. 
The same code, specialized to a single node, was run on a VAXl 1/780 and we 
obtained a sustained rate of approximately 32,000 floating point operations per 
second. This is the same rate obtained for the Lax-Wendroff code on the VAX. 


TR-86,7 


- 57 - A 


March 18, 1986 






C-Z 




F|C-|M*£ Z 


Another possibility is to relax the constraint that adjacent cells map, into 
adjacent nodes. It can be shown that if we allow adjacent cells to map into 
nodes that are at most a distance 2 apart, then we can find a cell-to-node map- 
ping in which each node contains exactly one cell for each value of a , b , and c . 
In this case it is also clear that the dimension of the hypercube must be 
even(2d ) and AA = BB = CC = 2 d . Such a mapping is shown in Figure 8.3. 
In this case cells which are adjacent in the z direction are mapped into nodes 
which are at a distance 2 from each other. All other adjacencies are preserved. 
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8.3. Performance 

The code for the Beam and Warming method was run using the cell-to*node 
mapping shown in Figure 8.3 on a 16-node hypercube with 4 cell processes per 
node. We have // - JJ = KK - 5. This amounts to a domain with 2G 3 grid 
points. The total number of floating point operations is approximately 214,000 
and the time per iteration (x_solve, y_solve, and z_solve) is about 2.5 seconds. 

This is a sustained rate of 5,350 floating point operations per second per node. 

The same code, specialized to a single node, was run on a VAXll/780 and we 

( 

obtained a sustained rate of approximately 32,000 floating point operations per 
second. This is the same rate obtained for the Lax-Wendroff code on the VAX. 

We conclude that a 32-node cube is about 5.3 times faster than a VAX on this 
code. The performance for this code is only 30% of the performance of the Lax- 

* fr 

f 

Wendroff code. 

The hypercube performance is poor for the Beam and Warming code. One 
reason is that the tridiagonal systems do not require much floating point compu- 
tation. Implementation of the Navier Stokes equations will result in systems of 
block tridiagonal equations which require significantly more floating point opera- 
tions. This will tilt the balance away from communication and should result in 
improved performance. 

More extensive tests of this code should be carried out for different cell-to- 
node mappings. These results should be compared to massive data rearrange- 
ment strategies. 

I 
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Performance of the Intel IPSC Hypercube 

A Kolawa, S, Otto f 

Physics Dept,, Caltech, Pasadena CA 91125 


October 8, 1985 
Introduction: 

The purpose of this note is to present the speeds of the fundamental opera* 
tions used in the Intel Hypercube, IPSC, It is a companion to an earlier note 
HmlSS describing the timing of the Hark II hypercubes constructed at JPL 

Floating Point Speed 

First off, we give the floating point performance (single precision 32 bit) of 
a node, This was done by employing an accurate timing routine which runs 
independently in every node. 

Multiply 

The following code was timed : 
float a,b,c; 

for (i-Q; 1 < NumTimes; ++i) { 
a = b*e; 

I 
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Tha loop overhead was separately measured (see below) and subtracted, 

Intel 1PSC : 40,4 /isec /multiply or ,024 Mflops 

The same measurement done for a,b,e being double gives: 

Intel IPSC ; 43,5 /xaec /multiply or .023 Mflops. 

Add: 

The code: 
float a,b,c; 

for (1=0; i < NumTlmes; ++i) ( 
a = b + c; 

1 

loop overhead was again subtracted, We And: 

Intel IPSC : 39.5 add or .025 Mflops 
For a,b,c double it is: 

Intel IPSC : 43 ^s/ add or .023 Mflops. 

Loop Overhead 

Just the above loop was run and timed. 

Intel IPSC : B.2 /us/loop 

For more complex, realistic expressions, the apparent floating point perfor* 
mance increases in realistic codes (e.g,, lattice gauge). To illustrate this, we 
give a second measure of floating point speed. 


* Research supported by the Department of Energy grants DE-AS03-ER131 18.DE-F C03- 
8S&R29009 and by the Parsons Foundation and Systems Development Foundation. S.Otto 
holds a Bantrell Research Fellowship at Caltech. 
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Floating Point Parfonnanca #2 
The code executed: 
float a, b, c, d, e; 
for (l = 0; l< NumTlmei; ++l) ( 
a * b*c + b*e + d*c; 

The time to execute this was; 

Intel IPSC : 119,3 fa When a,b,c,d,e were doubie the execution time was: 

Intel IPSC : • 126.8 fa 

Giving as the performance figure (A floating pomt operation is now con- 
sidered as a or 

Intel IPSC : 23.86 fa / flop -> .042 Mflops 
For double performance we got: 

Intel IPSC : 25.36 fa /flop -> .039 Mflops 

Integer (16 bit) parfonnanca 

Multiply: 

The code: - 
short j, k, 1; 

for (i - 0; l< NumTlmes; ++i){ 
j = k*l; 

j 

giving: 

Intel IPSC : 4 fa / integer multiply 
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Add : 

The code: short j, k, 1; 

for (l * 0; l< NumTimen, ++l)| 

J = k + 1; 

I 

giving: 

Intel 1PSC : 2 m* / integer add 

Intemode Communications: 
angle Packet: 

The objective is to measure the speed of the fundamental communications 
routines, wtELT/rdELT. The Hypercube was mapped to a ring and each node 
along the ring transferred a single, 64 bit packet one step forward in the ring, 
This is the sort of thing which happens in many codes; each node is both sending 
and receiving data. The code executed was: 
int data [4]; 

* for (l = 0; l< NumTlmes; ++i)J 

wtELT (data, forward chan); 
rdELT (data, backward chan); 

I 

The timings per single, 64 packet transfer are: 

Intel IPSC : 11920 fi s / single packet transfer 

This gives us a for single precision arithmetic by dividing these 

times by two, since t nmm is conventionally defined as the transfer time of a 32 
bit word. Note that f evmm is the time both to write and read a 32 bit word * this 
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Is what normally occurs • In homogeneous applications at least. (This definition 
is different from that In reference l.) 

Therefore, the appropriate to the usual efficiency analyses is: 

= 5960 (a Intel IPSC 

We can also relate "f (fnun " to "t/if", defined as the time to do a single float* 
lag point operation (32 bit), "t/np” has also been called "i,*/ 1 ; In other 

Hm memos, For the Intel IPSC machine, we have: 

ffimm s 149 * f/i»f 

f e#mm for double precision work is achieved by doubling the above and all follow* 
ing t etmm estimates, 

Global Communications ("reesig, eandaig ) 

The global broadcast utility, recsig, was timed. If N is the number of nodes 
In the Hypercube, the timings are of the form: 

a -b/3 (log N + 1) 


where a reresents a constant startup tune, /3 represents the communication 
time through each of the log N stages of the broadcast, and the +i is there 
because the corner node must first read from the IH. Results are; 


Dimension 

Intel 

of Cube 
1 ’ 

9500/43 

2 

12000 

3 

17000 

4 

22000 

5 

27500 


These timings do not take into account the operating system overhead, 
waiting, etc. . 

The timings, fit the theoretical form given in the above quite well; the 
parameters are; 









Bock Transfer. "Shift" 

We measured the block transfer of data between two neighboring nodes and 
compare result with Markll(SMHz) Caltech Hypercube [21. 


Number of Packets 

P«r packet 

tctmm per packet 

in the Shift 

Intel 

Mar kll( 5MHz) 

1 

5960iis 

125ms 

2 

3007 

93 

4 

1510 

76 

a 

777 

68 

18 

390 

64 

32 

202 

62 

64 

110 

61 

12B 

65 

60 

129 

102 

60 

132 

100 

60 

138 

97 

60 

144 

93 

60 

160 

86 

60 

192 

75 

60 

256 

32.5 

60 

257 

SO 

60 

260 

79.1 

60 

272 

77 

60 

268 

75.5 

60 

320 

69 

60 

384 

60 

60 

385 

73 

60 

418 

71 

60 

448 

67 

60 

512 

31.6 

60 


The above results are plot on figure 1. They show that when message length 
is equal to multiplicity of lkbyte then communication time per packet is 
minimal . The next byte cause the jump of communication time which then 
starts to decrease until the length of the message is again the muitiplicity of 
lkbyte. This behavior is caused by the operating system which tends to sends 
messages in lkbyte pieces. It seems that asymptotic communication time per 
64 bit packet is 58 ~ 60 fjs. 
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Figure Captions: 

1) Plot of logarithm of communication time per packet vs logarithm of base 2 
of number of 84 oits packets in the message. The solid line presents com- 
munication time per packet for Intel IPSC. The dashed lines represent 
communication time per packet for the Interrupt Driven Operating System 
(IDOS) and the Crystalline Operating System (CrOS) on Mark II 5MHz and 
BMHz Caltech/JPL Hypercubes . The dashed area presents range of change 
of floating point performances on Intel IPSC and MarkII(8MHz) machines.The 
"scalar" is floating point multiply and "vector" is an operation like that in 
section Floating Point Performance #2. 





